Design of predictive model to optimize the solubility of Oxaprozin as nonsteroidal anti-inflammatory drug

These days, many efforts have been made to increase and develop the solubility and bioavailability of novel therapeutic medicines. One of the most believable approaches is the operation of supercritical carbon dioxide fluid (SC-CO2). This operation has been used as a unique method in pharmacology due to the brilliant positive points such as colorless nature, cost-effectives, and environmentally friendly. This research project is aimed to mathematically calculate the solubility of Oxaprozin in SC-CO2 through artificial intelligence. Oxaprozin is a nonsteroidal anti-inflammatory drug which is useful in arthritis disease to improve swelling and pain. Oxaprozin is a type of BCS class II (Biopharmaceutical Classification) drug with low solubility and bioavailability. Here in order to optimize and improve the solubility of Oxaprozin, three ensemble decision tree-based models including random forest (RF), Extremely random trees (ET), and gradient boosting (GB) are considered. 32 data vectors are used for this modeling, moreover, temperature and pressure as inputs, and drug solubility as output. Using the MSE metric, ET, RF, and GB illustrated error rates of 6.29E−09, 9.71E−09, and 3.78E−11. Then, using the R-squared metric, they demonstrated results including 0.999, 0.984, and 0.999, respectively. GB is selected as the best fitted model with the optimal values including 33.15 (K) for the temperature, 380.4 (bar) for the pressure and 0.001242 (mole fraction) as optimized value for the solubility.

www.nature.com/scientificreports/ Remarkable progression in pharmaceutical industry has paved the way towards creating novel therapeutic drugs for treating various challenging diseases 1,2 . Despite noteworthy development, poor solubility of active pharmaceutical ingredients (APIs) can be considered as the most prominent limitations for drug development 3,4 . Oxaprozin (C 18 H 15 NO 3 ) can be recognized as one of the commonly-employed non-steroidal anti-inflammatory (NSAID) drug 5,6 . The analgesic and antipyretic characteristics of this propionic acid derivative has made it promising to appropriately alleviate the pain of acute/chronic disorders such as inflammation, swelling, osteoarthritis and rheumatoid arthritis 7,8 . Figure 1 presents the ball-stick demonstration of Oxaprozin. This NSAID drug possesses great ability to decline the formation of prostaglandin precursors from arachidonic acid via cyclo-oxygenase inhibition, which causes significant reduction in pain/inflammatory responses. Oxaprozin has shown superior efficacy compared to aspirin or piroxicam in the treatment of osteoarthritis 9 .
To improve the solubility of drugs, the indisputable role of solvents can't be ignored. These days, supercritical fluids (SCFs) are known as an innovative technique that demonstrates their efficiency for particle formation. This novel approach can overcome some disadvantages of conventional technologies such as crushing, crystallization and precipitation 11,12 . Supercritical carbon dioxide (SC-CO 2 ) is being frequently applied to fractionate the precious components in pharmaceutical processes due to possessing noteworthy properties such as abundancy, colorless nature, cost-effectiveness, and environmentally benign characteristic 13 . Due to the importance of solubility in SC-CO 2 for the design and development of novel drugs, the conduction of experimental investigation for evaluating the solubility of these drugs is of great importance 14 . Despite the great importance for obtaining the solubility of drugs, the existence of some economic/operational problems such as difficulty in solute-solvent interactions in SC-CO 2 system and high cost have limited the conduction of experimental investigation.
Therefore, development of mathematical modeling approaches to predict the solubility amount of disparate types of drugs can be an appropriate option to optimize the time and cost of processing. Nowadays, AI has been introduced as a promising predictive tool to measure the solubility of drugs, numerically. Apart from pharmacology, AI has found its indisputable role in disparate knowledge related to chemical engineering such as extraction, purification, separation, crystallization and chemical reactor engineering 15 . In most scientific fields, machine learning (ML) techniques are known as common computational procedures, including regression trees, neural networks, support vector machines. A variety of relationships between inputs and outputs are extracted by these models [16][17][18] .
The Decision Tree (DT) is one of the typically used learning models. A weak model is a simple predictor that is only likely to be better than a random estimator. The results of many base DT models are aggregated to form a stronger model in tree-based ensemble methods 19,20 .
Bagging and boosting are two of the most effective improvement strategies with Decision Trees. Bagging (Bootstrap Aggregating), developed by Breiman 21 , is one of the most basic and straightforward ensemble techniques, demonstrating outstanding performance while reducing variance and preventing overfitting. The Bagging algorithm is more diverse because of the bootstrap approach, which replicates and generates subsets of training data. All of the subsets are used to fit different basic estimators, and the final prediction results are compiled using a majority-vote method 21,22 .
One other ensemble method based on the Freund and Schapiro's study is boosting 23 . The aim of this research was optimization of Oxaprozin solubility within supercritical fluid by applying different machine learning models to find the best model for that.
By progressively reweighting the training data, this approach differs from Bagging in that it generates a diverse set of basic learners. A higher weight will be given to each sample whose estimation was weaker than www.nature.com/scientificreports/ the previous estimator's in the subsequent training step. As a result, in subsequent bootstrap samples, it is more likely that training samples with weak estimates will appear, allowing bias to be effectively removed. Based on their prediction performance, the base estimators are weighted in the final Boosting algorithm model. A random forest model, Extra Trees, and Gradient Boosting model were all considered for inclusion in this research [24][25][26][27] .

Experimental
Various predictive models in this research have been investigated and developed based on the experimental investigation of Khoshmaram et al. They experimentally measured the solubility of Oxaprozin using the combination of static and gravimetric techniques via a pressure-volume-temperature (PVT) cell 14 . This system can be filled with up to 0.4 L Oxaprozin and supercritical liquid. The adjustment of two momentous parameters for evaluating the solubility of drugs (temperature and pressure) in the PVT cell is an important advantage. In the PVT cell, increment of pressure causes the manufacturing of SC-CO 2 in the liquefaction unit. Then, the condensed solvent moves through the inline filter with the aim of purifying the solvent. In the next step, purified solvent enters a surge tank before the PVT cell. The controlling process of SC-CO 2 and Oxaprozin temperatures was implemented applying heating elements insulated by a PTFE layer.

Data set
This study's dataset is derived from 14 that have just 32 data vectors. Each vector has two input parameters (pressure and temperature) and one output (solubility). The dataset is shown in Table 1 and Pearson correlation 28 of parameters are shown in Fig. 2.

Methodology
Random forest and extra tree. The random forest ensemble learning model is a tree-based technique that, like other ensemble learning methods, which is used to enhance the effectiveness of multiple base tree learners 29 . There will then be an unpruned regression tree built for every bootstrapped sample. This is what will happen next. Instead of using all the current predictors, a specified number of K base models are picked randomly to perform the function of split possibilities in this stage. This two-step operation will be iterated unto C decision trees with the above-mentioned characteristics are optimized, at which point unobserved data can be predicted by gathering the estimations of these C trees. Random forest uses a bagging strategy to boost tree diversity via constructing DTs using different training subsets, minimizing the model's total variance 17 . An RF regression predictor is expressed in the following equation: According to the previous equation, C refers to the count of decision trees, x identifies the data point, and T i (x) refers to a unique DT built from bootstrap samples and a subset of entry variables. RF can predict out-of-bag error for the time being logging natively using samples which have not been selected in connection with the drive of this shaft during the bagging process. To compute an unbiased prediction of distribution error, this particular sub-association does not make use of any external data 19,30 . Assign substantial scores to each input variable. RF modifies one input variable while holding the others constant, and the model's average decrease is also assigned 19 .
Extra Trees (ET) are an overall tree-based approach like random forest. It strongly randomize both the cut point decision and the particularities of a tree node during its division Extra Tree becomes possible to categorize and regression tasks 31,32 .
As far as the differences are concerned, the two models are identical in that they develop multiple trees and divide nodes applying random subsets of functions, nevertheless, there are two major separations exist: Rather than using optimum splits, the ET uses randomized splits instead of bootstrap observations 33 .

Gradient boosting.
Boosting is also an ensemble learning technique. Boosting comprises a sequence of base predictors rather than a single predictor to average them all together to improve prediction accuracy. In a stage-wise process, base estimators (decision trees here) are successively fitted to eliminate bias. At each phase, a www.nature.com/scientificreports/ new learner is introduced to optimize the loss function. The first learner reduces the loss function to the smallest possible value using training data 24,34,35 . The residuals from the previous estimators are used by the following estimators. The gradient boosting method steps are depicted in the following Algorithm 24,35,36 :

Results
The tuning of the hyper-parameters of the mentioned models is based on a search grid. All three final models were evaluated by R-square and MSE criteria. Additionally, some visualization results were made, which will be discussed later. Figures 3, 4 and 5 show a comparison of expected values and predicted amounts. In the below figures, the blue line indicates the expected amounts and the points of the predicted values (red for the test data and black for the training data). In addition, Table 2 shows quantitative metrics to compare the three implemented  www.nature.com/scientificreports/ models with the optimal hyper-parameters. Comparison of tabulated results in Table 2 has confirmed the fact that the GB is the most accurate and general model (R 2 = 0.999 and MSE = 3.78E−11), which has been used as the main model for the rest of the analysis. The simultaneous impacts of temperature and pressure as two prominent input parameters on the solubility as the only output is shown in 3D in Fig. 6. Furthermore, by holding each of the inputs fixed, the two-dimensional Figs. 7 and 8 are displayed. These figures correspond to the reality of the optimal values in Table 3. It can be perceived from the figures that the pressure of system has positive impact on the solubility of Oxaprozin in supercritical system. Indeed, increase in the pressure can improve the solvent density, which consequently intensifies the solvating power of the SC-CO 2 system. Although pressure has direct connection with the solubility of drug, the impact of temperature is entirely indirect. To evaluate the effect of temperature on drug solubility, the role of sublimation pressure and density above and below the cross-over pressure (COP) must be analyzed. At the pressures above the COP, the encouraging influence of sublimation pressure on solubility dominates the deteriorative impact of density reduction. Therefore, at these pressures, temperature increment significantly enhances the solubility in SC-CO 2 system. At pressures below the COP, the destructive impact of density decrement overcomes the positive effect of sublimation pressure. Therefore, at these amounts of pressures, increasing the temperature significantly reduces the solubility in SC-CO 2 . By concentrating on Table 3, it is recognized that the pressure and the temperature of 380.4 bar and 333.15 K are the optimum factors for reaching the greatest amount of Oxaprozin solubility.

Conclusion
Now a days, numerous efforts have been made to develop green and efficient solvents to overcome the functional/ operational detriments of organic solvents. Nowadays, SC-CO 2 has been introduced as a prevalently employed liquid solvent to fractionate the valuable components and increase the solubility of drugs in pharmaceutical processes because of its remarkable advantages (i.e., abundancy, cost-effectives, and environmentally benign characteristic). In this paper, disparate types of numerical models were proposed via AI technique to anticipate the optimum value of Oxaprozin in SC-CO 2 . In this study, three ensemble decision tree-based models were used to model the problem: extremely random tree (ET), random forest (RF), and Gradient Tree Boosting (GB). This problem's available data consists of 32 data vectors with two inputs of temperature and pressure and an output of solubility. ET, RF, and GB had MSE error rates of 6.29E−09, 9.71E−09, and 3.78E−11. They also have R-squared scores of 0.999, 0.984, and 0.999, respectively. The final model chosen is GB, with the following optimal values: T = 33.15, P = 380.4, and solubility = 0.001242, which shows the greatest amount of Oxaprozin solubility. www.nature.com/scientificreports/